5420 Anomaly Detection, Fall 2020

Assignment 4

Submitted by: Harsh Dhanuka, hd2457

Flow of Analysis/Changes made from previous submission:

Section A:

  • Add 'population and income' dataset, which contains population and income, mean and median by zipcode

Section B:

Comment out some EDA's, as it was taking a lot of memory to run.

  • Most are interactive plotly graphs, hence, I will comment out some.
  • They all have been run and submitted during EDA Assignment 4 submission.

Section C:

Feature Engineering - create new features like:

  • Sum total of 'Total Discharged' grouped by the Provider Id and Provider Name
  • Ratio of Average Total Payments to Zip Median Income
  • Ratio of 'Total Discharges' to 'Zip_Population'

Section D:

  • Standardization of Numerical / Float variables
  • Normalize / one-hot-encoding of Categorical variables
  • Drop irrelevant variables
  • k-means Clustering
  • Visualizing the results
  • Explain the anomalies

I have selected to cluster on an 8 cluster model, and then explain the findings

Healthcare Data

Some guiding questions:

1. Medical charges vary widely by region and type of work and other factors. Can you claim a doctor who charges high a fraud?

Per my understanding, it is unwise to claim a doctor who charges high as a fraud. This is because, there are several other factors which needs to be considered, before declaring a doctor as a fraud. The most basic ones being that the comparison, which says a doctor is a charging high, needs to be on the same scale, see below for the scale metrics.

2. What is a fair comparison? Do you at least consider the type of medical work (DRG) and state?

A fair comparison indicates we are comparing apples to apples, that too the same type of apples. I mean the comparison needs to be in terms of the DSG or category of treatment, the location of the doctor or the hospital, and whether it is a general public hospital or a premium hospital, whose base charges of admission are very high no matter the treatment.

3. With (2), what are the features/variables that you can come up with?

  • Popular DSG_Definitions's by state
  • Number of Provider_Names per city
  • Average cost per procedure/treatment
  • Average Medicare Payments, converted to % of Average Total Payments
  • Medicare Payments, converted to % of Average Total Payments, State-wise
  • Out-of-pocket expense
  • Out-of-Pocket expense per discharge
  • Total treatments per state
  • Average treatments/patients by State (mean values, as grouped by State)
  • 'Median Average Total Payment'
  • 'Median Score`
  • 'Median Score Flag'
  • 'Median Score by Provider'
  • 'Provider Flag by Median Score'
  • Evaluate a potential fraud case, for median score > 2
  • Sum total of 'Total Discharged' grouped by the Provider Id and Provider Name
  • Ratio of Average Total Payments to Zip Median Income
  • Ratio of 'Total Discharges' to 'Zip_Population'

There are two types of anomolies, the first one is the one-off procedure that is extremely costly compared to the median cost charged by other hospitals in the same state. These instances were found out by scoring each treament which indicated how many times the cost of the treament was compared to others in the state.

The second type of anomoly were hospitals that overcharged treament to a lesser extent, but over the course of many treatments. This was found out by averageing the above mentioned score for all the treaments made by the hosptial, this averaged score became the hospital's score. Those with scores above 2 are flagged as hosptials that consistantly overcharged. A check was made to ensure that these hospitas were not just located in expensive cities as such cities tend to charge just an extra 20% to 70%.

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline
import scipy
import time
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings("ignore")
import missingno as msno

from sklearn.impute import SimpleImputer
from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

from sklearn.decomposition import PCA

from pprint import pprint
import zipcodes

import plotly
import plotly.express as px
In [2]:
# Read the data

df = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 5 Clustering/inpatientCharges.csv')
df.head()
Out[2]:
DRG Definition Provider Id Provider Name Provider Street Address Provider City Provider State Provider Zip Code Hospital Referral Region Description Total Discharges Average Covered Charges Average Total Payments Average Medicare Payments
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 $5777.24 $4763.73
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH BOAZ AL 35957 AL - Birmingham 14 $15131.85 $5787.57 $4976.71
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET FLORENCE AL 35631 AL - Birmingham 24 $37560.37 $5434.95 $4453.79
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE BIRMINGHAM AL 35235 AL - Birmingham 25 $13998.28 $5417.56 $4129.16
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH ALABASTER AL 35007 AL - Birmingham 18 $31633.27 $5658.33 $4851.44
In [3]:
# Rows and  Columns

print(" ")
print("Number of rows and columns in the dataset:")
df.shape
 
Number of rows and columns in the dataset:
Out[3]:
(163065, 12)
In [4]:
print(" ")
print("Basic statistics of the numerical columns are as follows:")

# Check basic statistics
df.describe()
 
Basic statistics of the numerical columns are as follows:
Out[4]:
Provider Id Provider Zip Code Total Discharges
count 163065.000000 163065.000000 163065.000000
mean 255569.865428 47938.121908 42.776304
std 151563.671767 27854.323080 51.104042
min 10001.000000 1040.000000 11.000000
25% 110092.000000 27261.000000 17.000000
50% 250007.000000 44309.000000 27.000000
75% 380075.000000 72901.000000 49.000000
max 670077.000000 99835.000000 3383.000000
In [5]:
# Check the column dtypes
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163065 entries, 0 to 163064
Data columns (total 12 columns):
DRG Definition                          163065 non-null object
Provider Id                             163065 non-null int64
Provider Name                           163065 non-null object
Provider Street Address                 163065 non-null object
Provider City                           163065 non-null object
Provider State                          163065 non-null object
Provider Zip Code                       163065 non-null int64
Hospital Referral Region Description    163065 non-null object
 Total Discharges                       163065 non-null int64
 Average Covered Charges                163065 non-null object
 Average Total Payments                 163065 non-null object
Average Medicare Payments               163065 non-null object
dtypes: int64(3), object(9)
memory usage: 14.9+ MB

Considerations from eyeballing the data:

  1. Check for categorical/object fields from the data variable descriptions. Convert the relevant numeric fields to their respective categorical/object fields: Provider Id

  2. The Zipcode column is represented as an integer. Convert it to zipcode format.

  3. Variable Hospital Referral Region Description comprises of the State and the city, which I see is the nearest metro city.

  4. Average Covered Charges is not significant for our analysis, it will be for other purposes such claims fraud, insurance premiums, etc.

  5. The two payments columns need to be converted to proper numeric formats.

In [6]:
# Basic Sort of Provider ID and DRG Definition

df = df.sort_values(['Provider Id', 'DRG Definition'])
df.head(2)
Out[6]:
DRG Definition Provider Id Provider Name Provider Street Address Provider City Provider State Provider Zip Code Hospital Referral Region Description Total Discharges Average Covered Charges Average Total Payments Average Medicare Payments
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 $5777.24 $4763.73
1079 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 $4894.76 $3865.50

A. Initial Data Cleaning

1. Rename all column names

The given column names have a lot of spaces, trailing spaces, etc. I will rename all the columns as per appropriate naming convention.

In [7]:
df.columns = ['DRG_Definition', 'Provider_Id', 'Provider_Name',
       'Provider_Street_Address', 'Provider_City', 'Provider_State',
       'Provider_Zip_Code', 'Hospital_Referral_Region_Description',
       'Total_Discharges', 'Average_Covered_Charges',
       'Average_Total_Payments', 'Average_Medicare_Payments']
df.columns
Out[7]:
Index(['DRG_Definition', 'Provider_Id', 'Provider_Name',
       'Provider_Street_Address', 'Provider_City', 'Provider_State',
       'Provider_Zip_Code', 'Hospital_Referral_Region_Description',
       'Total_Discharges', 'Average_Covered_Charges', 'Average_Total_Payments',
       'Average_Medicare_Payments'],
      dtype='object')

2. Drop irrelevant columns

As discussed above, the Average_Covered_Charges is irrelevant for fraud detection purpose. So, I will drop that column altogether.

In [8]:
# df = df.drop(columns = ['Average_Covered_Charges'])

3. Convert amount variables to appropriate numeric format, strip off the dollar symbol

In [9]:
df["Average_Total_Payments"] = df["Average_Total_Payments"].str[1:].astype(float)
df["Average_Medicare_Payments"] = df["Average_Medicare_Payments"].str[1:].astype(float)

4. Convert Provider_Id to appropriate object format

In [10]:
df["Provider_Id"] = df["Provider_Id"].astype(object)
In [11]:
df.head(2)
Out[11]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 5777.24 4763.73
1079 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 4894.76 3865.50

5. Convert Provided_Zip_Code from integer to appropriate Zipcode format

In [12]:
# Zipcode to 5 character integer zipcode format

df['Provider_Zip_Code'] = df['Provider_Zip_Code'].astype(str).str.zfill(5)

6. Check NA's

In [13]:
df.isnull().sum().sum()
Out[13]:
0

There are no NA's, which is good for our analysis.

7. Add a new "states" dataset to match 'Regions' with the exising Provider_State variable.

Regions will be a a very useful feature when performing the Exploratory Data Analysis.

In [14]:
states = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 4 EDA/states.csv')
states.head(5)
Out[14]:
State State Code Region Division
0 Alaska AK West Pacific
1 Alabama AL South East South Central
2 Arkansas AR South West South Central
3 Arizona AZ West Mountain
4 California CA West Pacific
In [15]:
# Left join the new dataset

df = pd.merge(left = df, right = states, left_on = 'Provider_State', right_on = 'State Code', how = 'left')
df.head(2)
Out[15]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments State State Code Region Division
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 5777.24 4763.73 Alabama AL South East South Central
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 4894.76 3865.50 Alabama AL South East South Central
In [16]:
# Remove duplicate 'state' column

df = df.drop(columns = ['State', 'State Code'])

8. Add 'Median Income' by zipcode dataset to match by zipcode in original dataset

Dataset source: https://www.psc.isr.umich.edu/dis/census/Features/tract2zip/

This has the zipcode wise mean and median income data for 2006 to 2010

In [17]:
income_df = pd.read_excel('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 5 Clustering/MedianZIP-3.xlsx')
income_df['Zip'] = income_df['Zip'].astype(str).str.zfill(5)
income_df.head(3)
Out[17]:
Zip Median Mean Pop
0 01001 56662.5735 66687.8 16445
1 01002 49853.4177 75062.6 28069
2 01003 28462.0000 35121 8491
In [18]:
income_df.isnull().sum()
Out[18]:
Zip       0
Median    0
Mean      0
Pop       0
dtype: int64
In [19]:
df = pd.merge(left = df, right = income_df, left_on = 'Provider_Zip_Code', right_on = 'Zip', how = 'left')
df = df.drop(columns = ['Zip','Mean'])

df.rename(columns={'Median':'Zip_Median_Income',
                   'Pop':'Zip_Population'}, inplace=True)

df.head(2)
Out[19]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Region Division Zip_Median_Income Zip_Population
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 5777.24 4763.73 South East South Central 38007.8711 35759.0
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 4894.76 3865.50 South East South Central 38007.8711 35759.0

.

-------------------------------------- SECTION BREAK ------------------------------------

.

B. Exploratory Data Analysis

Showing the Distribution of X

1. DRG_Definition Distribution

Explore the total number of DRG Definitions, and the count of how many times they appear.

In [20]:
df_count = df['DRG_Definition'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['DRG_Definition','Count']
df_count.head()
Out[20]:
DRG_Definition Count
0 194 - SIMPLE PNEUMONIA & PLEURISY W CC 3023
1 690 - KIDNEY & URINARY TRACT INFECTIONS W/O MCC 2989
2 292 - HEART FAILURE & SHOCK W CC 2953
3 392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DIS... 2950
4 641 - MISC DISORDERS OF NUTRITION,METABOLISM,F... 2899
In [21]:
fig = px.bar(df_count, x = 'DRG_Definition', y = 'Count', color = 'DRG_Definition',
             width=1450, height=500,
            title = "Distribution of DRG Definitions")
fig.show()

Observation:

The DRG Definition has a seemingly good distribution. The counts of DRG Definitons range from around 3000 to 600. All other DRG Definition counts lie within this range.

Here, any DRG Definition count doesnt seem like an outlier and all behave normally.

Explore the total number Provider Names and the count of how many times each one appears.

In [22]:
df_count = df['Provider_Name'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Provider_Name','Count']
df_count.head()
Out[22]:
Provider_Name Count
0 GOOD SAMARITAN HOSPITAL 633
1 ST JOSEPH MEDICAL CENTER 427
2 MERCY MEDICAL CENTER 357
3 MERCY HOSPITAL 347
4 ST JOSEPH HOSPITAL 343
In [23]:
# Show only those Provider_Names whose total count is above 100

df_count1 = df_count.loc[df_count['Count'] > 100]
fig = px.bar(df_count1, x='Provider_Name', y='Count',
             width=1200, height=500,
             color = 'Provider_Name',
            title = "Distribution of Provider Names, only showing for Count > 100")
fig.show()
In [24]:
# Show only those Provider_Names whose total count is below 3

df_count1 = df_count.loc[df_count['Count'] < 3]
fig = px.bar(df_count1, x='Provider_Name', y='Count',
             width=1200, height=500,
            color = 'Provider_Name',
            title = "Distribution of Provider Names, only showing for Count < 3")
# fig.show()

Observation:

From the above two count charts, it is clear than some Providers are extremely popular, and have around 600 entries. They seem to be the ones who provide services under multiple DRG Definitions.

While, some Providers are very unpopular, and have only 1 entry. Now, this depends on the DRG Definition, as some hospitals be a single specialty hospital, and hence everyone goes there only.

Explore the total number Cities and the count of how many times each one appears.

In [25]:
df_count = df['Provider_City'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Provider_City','Count']
df_count.head()
Out[25]:
Provider_City Count
0 CHICAGO 1505
1 BALTIMORE 1059
2 HOUSTON 950
3 PHILADELPHIA 898
4 BROOKLYN 877
In [26]:
# Show only those Provider_Cities whose total count is above 500

df_count1 = df_count.loc[df_count['Count'] > 500]
fig = px.bar(df_count1, x='Provider_City', y='Count',width=1000, height=500,
            color = 'Provider_City',
            title = "Distribution of Provider Cities, only showing for Count >500")
# fig.show()
In [27]:
# Show only those Provider_Cities whose total count is below 5

df_count1 = df_count.loc[df_count['Count'] < 5]
fig = px.bar(df_count1, x='Provider_City', y='Count',width=1000, height=500,
            color = 'Provider_City',
            title = "Distribution of Provider Cities, only showing for Count > 5")
# fig.show()

Observation:

Chicago is the most popular city with around 1500 entries. There are also a lot of other cities which have only 1 entry.

Explore the total number States and the count of how many times each one appears.

In [28]:
df_count = df['Provider_State'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Provider_State','Count']
df_count.head()
Out[28]:
Provider_State Count
0 CA 13064
1 TX 11864
2 FL 11155
3 NY 9178
4 IL 7909
In [29]:
fig = px.bar(df_count, x='Provider_State', y='Count',
             width=1000, height=500,
             color = 'Provider_State',
            title = "Distribution of Provider State")
# fig.show()

Observation:

The states seem to have a good distribution. There seems to be no outliers or staes requiring special attention.

5. Average_Total_Payments Distribution

In [30]:
fig = px.histogram(df, x="Average_Total_Payments",
            width=1000, height=500,
            title = "Distribution of Average Total Payments")
# fig.show()
In [31]:
fig = px.box(df, x="Average_Total_Payments",width=1000, height=500,
            title = "Distribution of Average Total Payments")
fig.show()
In [32]:
fig = px.violin(df, y="Average_Total_Payments", box=True, 
                points='all',width=800, height=500,
               title = "Distribution of Average Total Payments")
# fig.show()

Observation:

Most of the Average payments are less than USD 11,000. So, any average payment above that might be a reason for futher investigation.

There are some extreme high values, more than USD 150,000 which may need investigation.

  • Median: 7214.1

6. Average_Medicare_Payments Distribution

In [33]:
fig = px.histogram(df, x="Average_Medicare_Payments",
            width=1000, height=500,
                  title = "Distribution of Average Medicare Payments")
# fig.show()
In [34]:
fig = px.box(df, x="Average_Medicare_Payments",width=1000, height=500,
            title = "Distribution of Average Medicare Payments")
# fig.show()

Observation:

Most of the Average Medicare payments are less than USD 10,000. So, any average payment above that might be a reason for futher investigation.

There are some extreme high values, more than USD 150,000 which may need investigation.

  • Median: 6158.46

7. Total_Discharges Distruibution

In [35]:
fig = px.histogram(df, x="Total_Discharges",
            width=800, height=500,
                  title = "Distribution of Total Discharges")
# fig.show()
In [36]:
fig = px.box(df, x="Total_Discharges",width=1000, height=500,
            title = "Distribution of Total Discharges")
# fig.show()

Observation:

Most of the Total Discharges are less than 49.

There are some extreme high values, such as 3383, which may need investigation.

  • Median: 27

Showing the Distribution of Y by another Categorical Variable X

8. Average_Total_Payments by DRG_Definition

In [37]:
fig = px.box(df, x="DRG_Definition", y="Average_Total_Payments",width=1400, height=500,
             color = "DRG_Definition",
            title = "The Distribution of Average Total Payments by DRG Definition")
fig.show()

Observation:

Some DRG's have a very high Average Total Payments, these may be critical operations, which bear high cost.

9. Average_Total_Payments by State

In [38]:
fig = px.box(df, x="Provider_State", y="Average_Total_Payments",width=1000, height=500,
             color = "Provider_State",
            title = "The Distribution of Average Total Payments by Provider State")
fig.show()

Observation:

The Average Total Payments are more or less similar, but some states such as NY and CA are very expensiove overall.

10. Average_Total_Payments by Region

In [39]:
fig = px.box(df, x="Region", y="Average_Total_Payments",width=1000, height=500,
             color = "Region",
            title = "The Distribution of Average Total Payments by Region")
# fig.show()
In [40]:
# px.violin(df,x='Average_Total_Payments', y = "Region", color='Region',
#          title = "The Distribution of Average Total Payments by Region",
#          orientation='h').update_traces(side='positive',width=2)

Observation:

The West region seems to be generally high in terms of Total Average Payments. This was verified earlier as we saw the state CA was extremely high as well.

It is followed by Northeast, which includes the state NY.

11. Total_Discharges by DRG_Definition

In [41]:
fig = px.box(df, x="DRG_Definition", y="Total_Discharges",width=1400, height=500,
             color = "DRG_Definition",
            title = "The Distribution of Total Discharges by DRG Definition")
# fig.show()

Observation:

The Discharge rate for some DRG's is very high, while most others have a balanced discharged rate.

12. Total_Discharges by Region

In [42]:
fig = px.box(df, x="Region", y="Total_Discharges",width=1000, height=500,
             color = "Region",
            title = "The Distribution of Total Discharges by Region")
# fig.show()
In [43]:
# px.violin(df,x='Total_Discharges', y = "Region", color='Region',
#          title = "The Distribution of Total Discharges by Region",
#          orientation='h').update_traces(side='positive',width=2)

Observation:

Most regions have a similar total discharged pattern. However, the Northeast region has an outlier.

Showing interaction of two or three variables

13. See interaction between Average_Total_Payments and Average_Medicare_Payments

In [44]:
fig = px.scatter(df, x="Average_Total_Payments", y="Average_Medicare_Payments",
                 size = "Average_Total_Payments", color = 'Average_Total_Payments',
                size_max=60,width=800, height=600)
# fig.show()

Observation:

As the average total payments increase, the average medicare payments also increase, which shows that there is a very high collinearity between these two variables.

14. Understand features such as mean, median, min, max, etc of Average_Total_Payments

I will group the entire data by DRG_Definition and then calculate the statistics for each group overall.

In [45]:
agg_columns = ['mean', 'median', 'var', 'std', 'count', 'min', 'max']
groupby_drg = df[['DRG_Definition', 'Average_Total_Payments']].groupby(by='DRG_Definition').agg(agg_columns)

groupby_drg.columns = [header + '-' + agg_column 
                       for header, agg_column in zip(groupby_drg.columns.get_level_values(0), agg_columns)]

groupby_drg.columns = groupby_drg.columns.get_level_values(0)
In [46]:
groupby_drg.reset_index(inplace=True)
groupby_drg['Average_Total_Payments-range'] = groupby_drg['Average_Total_Payments-max'] - groupby_drg['Average_Total_Payments-min']

groupby_drg.head(2)
Out[46]:
DRG_Definition Average_Total_Payments-mean Average_Total_Payments-median Average_Total_Payments-var Average_Total_Payments-std Average_Total_Payments-count Average_Total_Payments-min Average_Total_Payments-max Average_Total_Payments-range
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 6960.534004 6582.89 2.184111e+06 1477.873952 1079 4968.00 18420.56 13452.56
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 6706.276445 6093.75 4.137017e+06 2033.965862 1201 4194.09 25519.43 21325.34
In [47]:
def plt_setup(_plt):
    _plt.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom='off',      # ticks along the bottom edge are off
    top='off',         # ticks along the top edge are off
    labelbottom='off')

See the 'mean' of the respective DRG_Definition groups

In [48]:
plt.figure(figsize=(20,8))
sns.barplot(x='DRG_Definition', y='Average_Total_Payments-mean', 
            data=groupby_drg.sort_values('Average_Total_Payments-mean'))
plt_setup(plt)
plt.title('Mean Average Total Payments by DRG', fontsize=16)
plt.ylabel('Mean of Average Total Payments', fontsize=16)
Out[48]:
Text(0, 0.5, 'Mean of Average Total Payments')

Observation:

Some DRG groups have very high mean, which implies that there are some DRG groups which generally charge a very high amount for treatment in terms of 'Total Payments'.

15. Sum of Average Total Payments by DRG_Definition

In [49]:
pyt_by_drg = df.groupby('DRG_Definition').sum().reset_index()
In [50]:
pyt_by_drg = pyt_by_drg.sort_values('Average_Total_Payments', ascending=False)
pyt_by_drg.head()
Out[50]:
DRG_Definition Total_Discharges Average_Total_Payments Average_Medicare_Payments Zip_Median_Income Zip_Population
52 329 - MAJOR SMALL & LARGE BOWEL PROCEDURES W MCC 37250 55742017.18 52336384.24 7.227002e+07 40066710.0
91 853 - INFECTIOUS & PARASITIC DISEASES W O.R. P... 39482 55474762.93 52038586.16 6.592058e+07 38370365.0
21 207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATO... 26412 44878915.12 41902364.87 5.382135e+07 32653342.0
92 870 - SEPTICEMIA OR SEVERE SEPSIS W MV 96+ HOURS 22624 41559656.85 39343567.52 4.464413e+07 26682375.0
68 470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT ... 427207 40059055.34 34277714.96 1.359685e+08 73137804.0
In [51]:
# Extract only rows with amount > 40,000,000

pyt_by_drg = pyt_by_drg.loc[pyt_by_drg['Average_Total_Payments'] > 40000000]
In [52]:
# plt.figure(figsize=(20,4))
# fig = sns.barplot(x='DRG_Definition', y='Average_Total_Payments', 
#             data=pyt_by_drg)
# fig.set_xticklabels(fig.get_xticklabels(), rotation=15)

# plt.title('Mean Average Total Payments by DRG, for total > 40,000,000', fontsize=16)
# plt.ylabel('Mean of Average Total Payments', fontsize=16)

Observation:

The DRG 329 is the highest in terms of total sum fom the Average Total Payments.

16. Unique ids, unique names, and unique cities for Providers

In [53]:
unique_ids = len(df.groupby('Provider_Id').count())
unique_providers = len(df.groupby('Provider_Name').count())
unique_cities = len(df.groupby('Provider_City').count())
unique_states = len(df.groupby('Provider_State').count())

print(" ")
print(f'There are {unique_ids} unique provider id values in the data, and {unique_providers} unique provider names in a total of {unique_cities} unique cities, and {unique_states} states.')
print(" ")
 
There are 3337 unique provider id values in the data, and 3201 unique provider names in a total of 1977 unique cities, and 51 states.
 

17. Check correlations between the three numerical/integer variables, region wise

In [54]:
fig = sns.pairplot(df[['Region', 'Total_Discharges', 'Average_Total_Payments','Average_Medicare_Payments']], 
             hue= 'Region')
fig
Out[54]:
<seaborn.axisgrid.PairGrid at 0x7fa7682c39d0>
In [55]:
corr = df[['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments']].corr()
f,ax = plt.subplots(figsize=(7,5))
sns.heatmap(corr, annot=True, cmap='Reds', linewidths=.4, fmt= '.1f',ax=ax)
plt.show()

Observation:

From above graphs, there are some variables that are highly correlated such as Average Total Payment and Average Medicare Payment. Average total payment has a long tail distribution, which could indicate potential fraud.

From corr matrix: Total payment is correlated with medicare payment.

We can conclude that those variables are indeed related, for modeling purposes, it more make sense to include only one or two of the three variables.

18. State wise-distribution of the three numerical variables

In [56]:
plt.figure(figsize=(20,20))
g = sns.PairGrid(df,
                 x_vars = ['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments'], 
                 y_vars = ['Provider_State'],
                 height=10, aspect=.25)

# Draw plot
g.map(sns.stripplot, size=10, orient="h",
      palette="ch:s=1,r=-.1,h=1_r", linewidth=1.10, edgecolor="w")

# Use the same x axis limits on all columns and add better labels
g.set(xlabel="", ylabel="")

titles = ["Total Discharges", "Average Total Payments",
          "Average Medicare Paymens"]

for ax, title in zip(g.axes.flat, titles):

    # Set a different title for each axes
    ax.set(title = title)

    # Make the grid horizontal instead of vertical
    ax.xaxis.grid(False)
    ax.yaxis.grid(True)

# plt.tight_layout()
# plt.show()
<Figure size 1440x1440 with 0 Axes>

19. Region wise-distribution of the three numerical variables

In [57]:
plt.figure(figsize=(20,20))
g = sns.PairGrid(df,
                 x_vars = ['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments'], 
                 y_vars = ['Region'],
                 height=10, aspect=.25)

# Draw plot
g.map(sns.stripplot, size=10, orient="h",
      palette="ch:s=1,r=-.1,h=1_r", linewidth=1.10, edgecolor="w")

# Use the same x axis limits on all columns and add better labels
g.set(xlabel="", ylabel="")

titles = ["Total Discharges", "Average Total Payments",
          "Average Medicare Paymens"]

for ax, title in zip(g.axes.flat, titles):

    # Set a different title for each axes
    ax.set(title = title)

    # Make the grid horizontal instead of vertical
    ax.xaxis.grid(False)
    ax.yaxis.grid(True)

# plt.tight_layout()
# plt.show()
<Figure size 1440x1440 with 0 Axes>

.

-------------------------------------- SECTION BREAK --------------------------------------

.

C. Feature Engineering

Feature 1

By understanding the top DSG's per state, we esablish a baseline for what people in the state normally get treated for. This information is useful in one of two ways:

  • Can be used to flag non-common and expensive procedures
  • Focus fraud detection on the common procedures.

Assuming that the fraudulent users would try to get treated for the same conditions as the population.

In [58]:
drg_by_state = df.groupby(['Provider_State', 'DRG_Definition']).agg({'DRG_Definition': 'count'})
drg_by_state.head()
Out[58]:
DRG_Definition
Provider_State DRG_Definition
AK 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 1
057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/O MCC 1
064 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W MCC 2
065 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W CC 6
066 - INTRACRANIAL HEMORRHAGE OR CEREBRAL INFARCTION W/O CC/MCC 4
In [59]:
drg_by_state.tail()
Out[59]:
DRG_Definition
Provider_State DRG_Definition
WY 872 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W/O MCC 2
897 - ALCOHOL/DRUG ABUSE OR DEPENDENCE W/O REHABILITATION THERAPY W/O MCC 1
917 - POISONING & TOXIC EFFECTS OF DRUGS W MCC 1
918 - POISONING & TOXIC EFFECTS OF DRUGS W/O MCC 2
948 - SIGNS & SYMPTOMS W/O MCC 3

Note:

I am not merging this with the original dataset as this is a new data table created to be used as a reference to check the most common DRG's by state

Feature 2

Number of Provider_Names per city

This information can be used in one of two ways:

  • Assuming that the city with the most providers have a higher probablity of being a victim fraud. It allows to focus on the cities with the highest concentration.
  • Assuming the cities with the lowest provider per population densitity are a higher risk. The reason is that fraudsters have less options to cheat, and they are forced to choose from a select few.
In [60]:
providers_per_city = df.groupby(['Provider_City']).agg({'Provider_Name':'count'})
providers_per_city.head()
Out[60]:
Provider_Name
Provider_City
ABBEVILLE 18
ABERDEEN 107
ABILENE 152
ABINGDON 63
ABINGTON 99
In [61]:
fig = plt.figure(figsize=(10,7))
providers_per_city.sort_values(by = 'Provider_Name', ascending = False).head()
sns.distplot(providers_per_city)
plt.tight_layout()

Note:

I am not merging this with the original dataset as this is a new data table created to be used as a reference to check the most common hospitals.

Feature 3

Average cost per procedure/treatment

Assuming that people/hospitals would likely cheat on the most expensive procedures, this feature would allow us to focus on the ones that would have the highest likelihood.

In [62]:
df['Average_Cost_Per_Procedure'] = df['Average_Total_Payments']/df['Total_Discharges']
df.head(2)
Out[62]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Region Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 5777.24 4763.73 South East South Central 38007.8711 35759.0 63.486154
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 4894.76 3865.50 South East South Central 38007.8711 35759.0 128.809474
In [63]:
fig = plt.figure(figsize=(15,7))
plt.boxplot(df['Average_Cost_Per_Procedure'], vert=False)
plt.title('Averate Cost Per Procedure')
plt.xlabel("Cost in $")
plt.tight_layout()

Feature 4

Average Medicare Payments, converted to % of Average Total Payments

Medicare % paid varies by states, hospitals, and procedure. This feature will allow us to determing which hospitals, treatment and procedures are viewed favorably by medicare.

In [64]:
df['Medicare_%_Paid'] = df['Average_Medicare_Payments']/df['Average_Total_Payments']
df.head(2)
Out[64]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Region Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 5777.24 4763.73 South East South Central 38007.8711 35759.0 63.486154 0.824568
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 4894.76 3865.50 South East South Central 38007.8711 35759.0 128.809474 0.789722
In [65]:
fig = plt.figure(figsize=(15,7))
sns.boxplot(df['Medicare_%_Paid'])
plt.title('Percentage of Average Total Payments Paid by Medicare (Average)')
plt.tight_layout()

Observation:

Most average medicare payments are around 80-90% of the average total payments.

Feature 5

Medicare Payments, converted to % of Average Total Payments, State-wise

I am trying to understand if there are any states with 99% and greater medicare payout averages. Hypothesis is that those states are more attractive for fraud.

In [66]:
medicare_pct_state = df.groupby('Provider_State').agg({'Medicare_%_Paid': 'mean'}).reset_index()
medicare_pct_state.head(5)
Out[66]:
Provider_State Medicare_%_Paid
0 AK 0.871982
1 AL 0.816622
2 AR 0.834876
3 AZ 0.842718
4 CA 0.885084
In [67]:
df = pd.merge(left = df, right = medicare_pct_state, left_on = 'Provider_State', right_on = 'Provider_State',
              how = 'left')

df.rename(columns = {'Medicare_%_Paid_x':'Medicare_%_Paid',
                     'Medicare_%_Paid_y':'Medicare_%_Paid_State'}, inplace=True)

df.head(2)
Out[67]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Region Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 5777.24 4763.73 South East South Central 38007.8711 35759.0 63.486154 0.824568 0.816622
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 4894.76 3865.50 South East South Central 38007.8711 35759.0 128.809474 0.789722 0.816622
In [68]:
fig = px.scatter(df, x="Provider_State", y="Medicare_%_Paid_State", width=1000, height=500,
             color = "Provider_State",
            title = "Medicare % Paid Distribution (by State)")
fig.show()
In [69]:
fig = plt.figure(figsize=(15,7))
sns.distplot(df['Medicare_%_Paid_State'])
plt.title('Medicare % Paid Distribution (by State)')
plt.tight_layout()

Feature 6

Out-of-pocket expense, difference between 'Average_Total_Payments' & 'Average_Medicare_Payments'

Out of pocket highlight procedures/treatments that are most expensive. The hypothesis is that the procedures with the highest out of pocket cost are the least likely to be a target for fraud.

In [70]:
df['Out_of_Pocket_Payment'] = df['Average_Total_Payments'] - df['Average_Medicare_Payments']
df.head(2)
Out[70]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges Average_Total_Payments Average_Medicare_Payments Region Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 5777.24 4763.73 South East South Central 38007.8711 35759.0 63.486154 0.824568 0.816622 1013.51
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 4894.76 3865.50 South East South Central 38007.8711 35759.0 128.809474 0.789722 0.816622 1029.26
In [71]:
sorted_avg_out_of_pocket = df.groupby(['DRG_Definition']).agg({'Out_of_Pocket_Payment': 'mean'})
sorted_avg_out_of_pocket.sort_values(by = 'Out_of_Pocket_Payment',ascending=False).head()
Out[71]:
Out_of_Pocket_Payment
DRG_Definition
460 - SPINAL FUSION EXCEPT CERVICAL W/O MCC 3735.070150
473 - CERVICAL SPINAL FUSION W/O CC/MCC 2594.714232
247 - PERC CARDIOVASC PROC W DRUG-ELUTING STENT W/O MCC 2582.521719
207 - RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT 96+ HOURS 2559.372528
853 - INFECTIOUS & PARASITIC DISEASES W O.R. PROCEDURE W MCC 2497.221490
In [72]:
fig = plt.figure(figsize=(15,7))
sns.distplot(df['Out_of_Pocket_Payment'])
plt.title('Medicare_%_Paid_Distribution_by_State')
plt.tight_layout()

Feature 7

Out-of-Pocket expense per discharge

Hospital that have high out of pocket cost can be useful to narrow down the ones unattractive to fraudsters.

In [73]:
df['Out_of_Pocket_per_discharge'] = df['Out_of_Pocket_Payment']/df['Total_Discharges']
df.head(2)
Out[73]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Average_Medicare_Payments Region Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 ... 4763.73 South East South Central 38007.8711 35759.0 63.486154 0.824568 0.816622 1013.51 11.137473
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 ... 3865.50 South East South Central 38007.8711 35759.0 128.809474 0.789722 0.816622 1029.26 27.085789

2 rows × 21 columns

In [74]:
fig = plt.figure(figsize=(15,7))
sns.distplot(df['Out_of_Pocket_per_discharge'])
plt.tight_layout()

Feature 8

Total treatments per state

The nomimal amount of treatments per state can be a misleading number to look at, given that the states with the most people will bubble to the top. This feature can be useful when compared against the population.

In [75]:
patients_states = df['Provider_State'].value_counts()
patients_states = pd.DataFrame(patients_states).reset_index()
patients_states.columns = ['Provider_State','Count']
patients_states.head()
Out[75]:
Provider_State Count
0 CA 13064
1 TX 11864
2 FL 11155
3 NY 9178
4 IL 7909
In [76]:
df = pd.merge(left = df, right = patients_states, left_on = 'Provider_State', right_on = 'Provider_State',
              how = 'left')

df.rename(columns = {'Count':'State_Total'}, inplace=True)

df.head(2)
Out[76]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Region Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 ... South East South Central 38007.8711 35759.0 63.486154 0.824568 0.816622 1013.51 11.137473 3635
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 ... South East South Central 38007.8711 35759.0 128.809474 0.789722 0.816622 1029.26 27.085789 3635

2 rows × 22 columns

In [77]:
fig = px.scatter(df, x="Provider_State", y="State_Total", width=1000, height=500,
             color = "Provider_State",
            title = "Total Procedures/Treatments  per state")
fig.show()
In [78]:
fig = plt.figure(figsize=(15,7))
sns.distplot(df['State_Total'])
plt.tight_layout()

Feature 9

Average treatments/patients by State (mean values, as grouped by State)

States have differents norms and rules. This features allows us to capture the normal state of each. Result can also be used to compare against the mean.

In [79]:
patient_avg_state = df.groupby('Provider_State').mean()[['Total_Discharges', 
                                                         'Average_Total_Payments',
                                                         'Average_Medicare_Payments']].reset_index()
patient_avg_state.head()
Out[79]:
Provider_State Total_Discharges Average_Total_Payments Average_Medicare_Payments
0 AK 26.588745 14572.391732 12958.969437
1 AL 39.258322 7568.232149 6418.007120
2 AR 41.978229 8019.248805 6919.720832
3 AZ 36.690284 10154.528211 8825.717240
4 CA 36.357854 12629.668472 11494.381678
In [80]:
patient_avg_state.loc[:,'Total_Discharges':'Average_Medicare_Payments'].corr()
Out[80]:
Total_Discharges Average_Total_Payments Average_Medicare_Payments
Total_Discharges 1.000000 -0.124043 -0.060745
Average_Total_Payments -0.124043 1.000000 0.991735
Average_Medicare_Payments -0.060745 0.991735 1.000000
In [81]:
fig = plt.figure(figsize=(15,10))

plt.subplot(2, 2, 1)
plt.boxplot(patient_avg_state['Total_Discharges'])
plt.title('Total Disharge Box plot')
plt.xlabel('')

plt.subplot(2, 2, 3)
plt.boxplot(patient_avg_state['Average_Total_Payments'])
plt.title('Average Total Payment Boxplot')
plt.xlabel('')

plt.subplot(2, 2, 4)
plt.boxplot(patient_avg_state['Average_Medicare_Payments'])
plt.title('Average Medicare Payment Boxplot')
plt.tight_layout()
plt.show()
In [82]:
# States with highest discharges

patient_avg_state.sort_values(by = 'Total_Discharges', ascending = False).head()
Out[82]:
Provider_State Total_Discharges Average_Total_Payments Average_Medicare_Payments
8 DE 67.901015 10360.072411 8959.673274
22 MI 54.539952 9754.420406 8662.157756
31 NJ 52.052839 10678.988647 9586.940056
20 MD 51.955255 12608.947664 11480.121829
27 NC 51.043841 9089.435711 7998.649702
In [83]:
# States with highest Average Total Payments

patient_avg_state.sort_values(by = 'Average_Total_Payments', ascending = False).head()
Out[83]:
Provider_State Total_Discharges Average_Total_Payments Average_Medicare_Payments
0 AK 26.588745 14572.391732 12958.969437
7 DC 43.954545 12998.029416 11811.967706
11 HI 26.497738 12775.739525 10967.475045
4 CA 36.357854 12629.668472 11494.381678
20 MD 51.955255 12608.947664 11480.121829

Note:

These new mean columns are useful, but I believe Median columns will be more handy. So, I will not add the mean columns to the original dataset yet.

Feature 10

Calculate the median of average total payment amount by DRG, by state.

'Median Average Total Payment'

To catch payments for treaments that exceed a normal amount, I will first create a feature that calculates a score for each data row. The score will indicate the size the payment amount for a particular treament relative to the median size of the payment amount for the DRG code by state level.

In [84]:
median_drg_state = df.groupby(['DRG_Definition','Provider_State'])['Average_Total_Payments'].median().reset_index()
median_drg_state.head()
Out[84]:
DRG_Definition Provider_State Average_Total_Payments
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AK 8401.95
1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AL 5658.33
2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AR 5890.00
3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AZ 6959.89
4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC CA 7863.14
In [85]:
df = pd.merge(left = df, right = median_drg_state, left_on = ['DRG_Definition','Provider_State'], right_on = ['DRG_Definition','Provider_State'], how = 'left')
df.rename(columns={'Average_Total_Payments_y':'Median_Avg_Total_Pyt',
                   'Average_Total_Payments_x':'Average_Total_Payments'}, inplace=True)
df.head(2)
Out[85]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 ... East South Central 38007.8711 35759.0 63.486154 0.824568 0.816622 1013.51 11.137473 3635 5658.33
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 ... East South Central 38007.8711 35759.0 128.809474 0.789722 0.816622 1029.26 27.085789 3635 4913.89

2 rows × 23 columns

In [86]:
# Check for one particular state and one particular DRG, to see median

df[(df['Provider_State'] == 'NV') & (df['DRG_Definition'] == '194 - SIMPLE PNEUMONIA & PLEURISY W CC')].head(2)
Out[86]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Division Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt
89303 194 - SIMPLE PNEUMONIA & PLEURISY W CC 290001 RENOWN REGIONAL MEDICAL CENTER 1155 MILL STREET RENO NV 89502 NV - Reno 100 $26254.72 ... Mountain 38624.424 45159.0 66.807600 0.774291 0.827003 1507.91 15.079100 1202 6990.625
89400 194 - SIMPLE PNEUMONIA & PLEURISY W CC 290003 SUNRISE HOSPITAL AND MEDICAL CENTER 3186 S MARYLAND PKWY LAS VEGAS NV 89109 NV - Las Vegas 80 $62134.62 ... Mountain 37456.017 9490.0 103.280125 0.765924 0.827003 1934.03 24.175375 1202 6990.625

2 rows × 23 columns

Feature 11

Creating a common median multiple score for Average Total Payments

Explain a potential fraud case

'Median Score`

I will now take the average total payment and divide it by the median payment to generate a simple score that indicates how many times the current payment amount is larger than the median amount.

In [87]:
df['Median_Score'] = df['Average_Total_Payments']/df['Median_Avg_Total_Pyt']
df.head(2)
Out[87]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 ... 38007.8711 35759.0 63.486154 0.824568 0.816622 1013.51 11.137473 3635 5658.33 1.021015
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 ... 38007.8711 35759.0 128.809474 0.789722 0.816622 1029.26 27.085789 3635 4913.89 0.996107

2 rows × 24 columns

In [88]:
fig = plt.figure(figsize=(10,5))
sns.distplot(df['Median_Score'])
plt.tight_layout()
In [89]:
fig = plt.figure(figsize=(15,5))
sns.boxplot(df['Median_Score'])
plt.tight_layout()
In [90]:
df['Median_Score'].describe()
Out[90]:
count    163065.000000
mean          1.050746
std           0.211465
min           0.517695
25%           0.925511
50%           1.000000
75%           1.112126
max           9.338775
Name: Median_Score, dtype: float64

Observation:

It would appear that most treatment payment amounts for the same DRG within the same state are within 90% to 110% of the median price. This is expected as normally doctors should be charging similar prices for similar treatment in simlar areas. However, we see instances where the payment amount is many times that of the median.

As we see in the box plot above, in two specific cases, treament cost over 9 times the median amount.

Let us investigate further.

Below are the cases where the payment made was over 6 times the median score amount.

In [91]:
df[df['Median_Score'] >= 6]
Out[91]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score
69869 203 - BRONCHITIS & ASTHMA W/O CC/MCC 220008 STURDY MEMORIAL HOSPITAL 211 PARK STREET ATTLEBORO MA 02703 RI - Providence 11 $7965.18 ... 67397.7900 41916.0 3771.099091 0.043155 0.872525 39691.91 3608.355455 3842 4441.92 9.338775
125467 189 - PULMONARY EDEMA & RESPIRATORY FAILURE 390096 ST JOSEPH MEDICAL CENTER 2500 BERNVILLE ROAD READING PA 19605 PA - Reading 143 $24542.94 ... 56478.8113 16695.0 509.076434 0.106021 0.843987 65079.84 455.103776 7804 7956.00 9.150067
130307 948 - SIGNS & SYMPTOMS W/O MCC 390312 CANCER TREATMENT CENTERS OF AMERICA 1331 EAST WYOMING AVENUE PHILADELPHIA PA 19124 PA - Philadelphia 24 $83945.95 ... 29285.1249 62905.0 1207.008333 0.307033 0.843987 20074.00 836.416667 7804 4476.05 6.471822
156606 249 - PERC CARDIOVASC PROC W NON-DRUG-ELUTING ... 500051 OVERLAKE HOSPITAL MEDICAL CENTER 1035-116TH AVE NE BELLEVUE WA 98004 WA - Seattle 23 $44499.00 ... 99669.2313 27946.0 3673.880870 0.100600 0.841795 75998.66 3304.289565 2778 12850.13 6.575751

4 rows × 24 columns

Observation:

The highest median score is 9.33 which is for:

  • index number - 69869
  • DRG - 203 - BRONCHITIS & ASTHMA W/O CC/MCC
  • City - ATTLEBORO
  • State - MA

This individual was charged USD 41,482 for a treament that had median amount of just USD 4,441.92. This particular treatment was performed at Provider - STURDY MEMORIAL HOSPITAL.

Now, Let's examine this particular hospital more closely.

In [92]:
suspect_hospital1 = df[df['Provider_Name'] == 'STURDY MEMORIAL HOSPITAL']['Median_Score']

fig = plt.figure(figsize=(14,5))
sns.distplot(suspect_hospital1)
plt.tight_layout()

print(" ")
print("Median Score distribution for Provider - STURDY MEMORIAL HOSPITAL is as follows")
print(" ")
print(suspect_hospital1.describe())
 
Median Score distribution for Provider - STURDY MEMORIAL HOSPITAL is as follows
 
count    69.000000
mean      1.027317
std       1.016937
min       0.786568
25%       0.873115
50%       0.903796
75%       0.928759
max       9.338775
Name: Median_Score, dtype: float64

Observation:

The graphical representation is strange. The hospital typically charges less than the median amount for its treatments as we see that the highest number of observations fall beloew the median score of 1. This makes the treament with the median mulitple score of over 9 highly unusal.

Now, another hospital with a treatment which cost way over the median amount is 'ST JOSEPH MEDICAL CENTER', which is:

  • index number - 125467
  • DRG - 189 - PULMONARY EDEMA & RESPIRATORY FAILURE
  • City - READING
  • State - PA

Lets examine this hosptial's track record too.

In [93]:
suspect_hospital2 = df[df['Provider_Name'] == 'ST JOSEPH MEDICAL CENTER']['Median_Score']

fig = plt.figure(figsize=(14,5))
sns.distplot(suspect_hospital2)
plt.tight_layout()

print(" ")
print("Median Score distribution for Provider - ST JOSEPH MEDICAL CENTER is as follows")
print(" ")
print(suspect_hospital2.describe())
 
Median Score distribution for Provider - ST JOSEPH MEDICAL CENTER is as follows
 
count    427.000000
mean       1.060935
std        0.424824
min        0.720824
25%        0.933857
50%        1.007073
75%        1.099725
max        9.150067
Name: Median_Score, dtype: float64

Observation:

Again, this is a hospital that typically charges resonable prices for treatment, most observations are below the median score of 2, which is fine. But, this makes the one treatment that is over 9 times the median price an extreme outlier deserving of extra attention.

Making the broad assumption that around 1% of medical payments are fraudulent. I will tag any medical treatment that paid more than the top 99th percentile of median_scores in the dataset.

In [94]:
np.percentile(df['Median_Score'], 99)
Out[94]:
1.790032745554255

1% of medical treatments cost more than 1.79 times the median payment of that treament by state. Treaments that paid more than this shall be flagged.

Feature 12

Boolean Flag for Providers, based on the Median Score, if not in 99% percentile of data.

'Median Score Flag'

To catch payments for treaments that exceed a normal amount, I will first create a feature that calculates a score for each data row. The score will indicate the size the payment amount for a particular treament relative to the median size of the payment amount for the DRG code by state level.

In [95]:
df['Median_Score_Flag'] = df['Median_Score'] >= np.percentile(df['Median_Score'], 99)
df.head(2)
Out[95]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score Median_Score_Flag
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 ... 35759.0 63.486154 0.824568 0.816622 1013.51 11.137473 3635 5658.33 1.021015 False
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 ... 35759.0 128.809474 0.789722 0.816622 1029.26 27.085789 3635 4913.89 0.996107 False

2 rows × 25 columns

This approach is good for finding providers that overcharge substancially. However, it is not so useful in find hospitals that overcharge slightly but over the course of many treatments. One way to find these providers is to find which ones have the highest average median_score.

Feature 13

Calculate the Median Score, this time based in individual Provider level, previously we did on individual row level.

'Median Score by Provider'

To catch payments for treaments that exceed a normal amount, I will first create a feature that calculates a score for each data row. The score will indicate the size the payment amount for a particular treament relative to the median size of the payment amount for the DRG code by state level.

In [96]:
Median_Score_by_Provider = df.groupby(['Provider_Name']).mean()['Median_Score'].reset_index()
print(Median_Score_by_Provider.head(2))

fig = plt.figure(figsize=(10,5))
sns.distplot(Median_Score_by_Provider['Median_Score'])
plt.tight_layout()

print(" ")
print("Median Score distribution for Provider -  is as follows")
print(" ")
print(Median_Score_by_Provider.describe())
                  Provider_Name  Median_Score
0    ABBEVILLE GENERAL HOSPITAL      1.033378
1  ABBOTT NORTHWESTERN HOSPITAL      1.056686
 
Median Score distribution for Provider -  is as follows
 
       Median_Score
count   3201.000000
mean       1.043923
std        0.200934
min        0.635737
25%        0.927984
50%        0.995734
75%        1.096960
max        3.777786

Some providers consistantly charge multiple times the median payment amount. However, before we blow the horn on these hospitals, we have to consider the fact that perhaps these hospitals are charging more than their state counter parts because they are either high-end/luxury hosptials and/or they are located in expensive cities. Lets see what hospitals these are.

In [97]:
df = pd.merge(left = df, right = Median_Score_by_Provider, left_on = 'Provider_Name', 
              right_on = 'Provider_Name', how = 'left')

df.rename(columns={'Median_Score_x':'Median_Score',
                   'Median_Score_y':'Median_Score_by_Provider'}, inplace=True)
df.head(2)
Out[97]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score Median_Score_Flag Median_Score_by_Provider
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 ... 63.486154 0.824568 0.816622 1013.51 11.137473 3635 5658.33 1.021015 False 1.019344
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 ... 128.809474 0.789722 0.816622 1029.26 27.085789 3635 4913.89 0.996107 False 1.019344

2 rows × 26 columns

Display the hospitals, that on average charge more than double the median price for the treatments they provide copmared to others within their state.

I had expected these hospitals to either be luxury hospitals or located in expensive cities.

In [98]:
df[df['Median_Score_by_Provider'] >= 2][['Provider_Name', 'Provider_State','Provider_City']].drop_duplicates()
Out[98]:
Provider_Name Provider_State Provider_City
15017 CONTRA COSTA REGIONAL MEDICAL CENTER CA MARTINEZ
19110 MEMORIAL HOSPITAL LOS BANOS CA LOS BANOS
23176 KEEFE MEMORIAL HOSPITAL CO CHEYENNE WELLS
23353 VAIL VALLEY MEDICAL CENTER CO VAIL
27595 JACKSON MEMORIAL HOSPITAL FL MIAMI
46304 MIDWESTERN REGION MED CENTER IL ZION
96910 GUADALUPE COUNTY HOSPITAL NM SANTA ROSA
130305 CANCER TREATMENT CENTERS OF AMERICA PA PHILADELPHIA
138799 UNIVERSITY OF TEXAS MEDICAL BRANCH GAL TX GALVESTON
159340 WELCH COMMUNITY HOSPITAL WV WELCH

Feature 14

Create a boolean benchmark for hospitals who overcharge

'Provider Flag by Median Score'

I set the benchmark at an upper level of 2 or higher median score, this will avoid situations where the cost of the city impacts the cost of treatment as even in the most expensive cities, the average cost of treament will in an expensive city will never be double the average cost of treament outside the city within the same state.

In [99]:
df['Provider_Flag_by_Median_Score'] = df['Median_Score_by_Provider'] >= 2
df.head(2)
Out[99]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score Median_Score_Flag Median_Score_by_Provider Provider_Flag_by_Median_Score
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 ... 0.824568 0.816622 1013.51 11.137473 3635 5658.33 1.021015 False 1.019344 False
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 ... 0.789722 0.816622 1029.26 27.085789 3635 4913.89 0.996107 False 1.019344 False

2 rows × 27 columns

See a the list of providers that are flagged as their treamtent payments are on average, more than double the median amount for the state they are in.

In [100]:
df[df['Provider_Flag_by_Median_Score']][['Provider_Id','Provider_Name','Provider_City',
                                         'Provider_State','Median_Score_by_Provider']].drop_duplicates()
Out[100]:
Provider_Id Provider_Name Provider_City Provider_State Median_Score_by_Provider
15017 50276 CONTRA COSTA REGIONAL MEDICAL CENTER MARTINEZ CA 2.254486
19110 50528 MEMORIAL HOSPITAL LOS BANOS LOS BANOS CA 2.508609
23176 60043 KEEFE MEMORIAL HOSPITAL CHEYENNE WELLS CO 2.376034
23353 60096 VAIL VALLEY MEDICAL CENTER VAIL CO 2.030422
27595 100022 JACKSON MEMORIAL HOSPITAL MIAMI FL 2.173011
46304 140100 MIDWESTERN REGION MED CENTER ZION IL 2.861803
96910 320067 GUADALUPE COUNTY HOSPITAL SANTA ROSA NM 2.187486
130305 390312 CANCER TREATMENT CENTERS OF AMERICA PHILADELPHIA PA 3.777786
138799 450018 UNIVERSITY OF TEXAS MEDICAL BRANCH GAL GALVESTON TX 2.336738
159340 510086 WELCH COMMUNITY HOSPITAL WELCH WV 2.085069

I will now pick the hospital with the highest score, which is - CANCER TREATMENT CENTERS OF AMERICA.

Now, I assume this hospital constantly overcharges on procedures that are much cheaper in other hospitals within the same state. Lets see it below.

In [101]:
df[df['Provider_Name'] == 'MEMORIAL HOSPITAL LOS BANOS'][['Median_Score']]
Out[101]:
Median_Score
19110 2.624709
19111 2.473714
19112 2.490096
19113 2.748748
19114 2.305088
19115 2.489811
19116 2.471841
19117 2.464865

Observation:

As we see, this is an effective method to track and find hospitals who overcharge or committ fraud.

Feature 15

Sum total of 'Total Discharged' grouped by the Provider Id and Provider Name

I will check the grand total of the number of discharges made by ech Provider. The hypothesis is that the hospitals with the highest number of diacharges are more susceptible to fraud, having false patients and fake claims.

In [102]:
discharges_provider = df.groupby(['Provider_Id','Provider_Name'])['Total_Discharges'].sum().reset_index(name='Grand Total of Discharges')
discharges_provider = discharges_provider.sort_values(by='Grand Total of Discharges', ascending=False)
discharges_provider.head()
Out[102]:
Provider_Id Provider_Name Grand Total of Discharges
599 100007 FLORIDA HOSPITAL 25828
1970 330101 NEW YORK-PRESBYTERIAN HOSPITAL 16834
2882 450388 METHODIST HOSPITAL 15921
583 80001 CHRISTIANA CARE HEALTH SERVICES, INC. 14542
1534 230130 WILLIAM BEAUMONT HOSPITAL 14469

Observation:

It is seen that the highest discharges are by 'Florida Hospital', at 25,828.

However, it would be ideal to check the total discharges as group be Provider State to get better inference.

In [103]:
discharges_provider_state = df.groupby(['Provider_State','Provider_Id', 'Provider_Name'])['Total_Discharges'].sum().reset_index(name='Grand Total of Discharges')
discharges_provider_state = discharges_provider_state.sort_values(by='Grand Total of Discharges', ascending=False)
discharges_provider_state.head()
Out[103]:
Provider_State Provider_Id Provider_Name Grand Total of Discharges
599 FL 100007 FLORIDA HOSPITAL 25828
2065 NY 330101 NEW YORK-PRESBYTERIAN HOSPITAL 16834
2882 TX 450388 METHODIST HOSPITAL 15921
590 DE 80001 CHRISTIANA CARE HEALTH SERVICES, INC. 14542
1534 MI 230130 WILLIAM BEAUMONT HOSPITAL 14469

Feature 16

Ratio of 'Average Total Payments' to 'Zip_Median_Income'

'Avg_Payment_by_Median_Income'

This ratio will give the an idea as to whether the average payments are higher or lower than the median income of the population. The hypothesis is that if the ratio is high, then the persons in that zip code are paying much higher than their median income for the treatment, which might be a fraud case, as the patient/hospital might have inflated bills.

Even the same hospital may have lower ratio for a particular DRG or State, but higher for another one, so I will not group the data. I need individual ratio for each row entry.

In [104]:
df['Avg_Payment_by_Median_Income'] = df['Average_Total_Payments'] / df['Zip_Median_Income']
df.head(2)
Out[104]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score Median_Score_Flag Median_Score_by_Provider Provider_Flag_by_Median_Score Avg_Payment_by_Median_Income
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 ... 0.816622 1013.51 11.137473 3635 5658.33 1.021015 False 1.019344 False 0.152001
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 ... 0.816622 1029.26 27.085789 3635 4913.89 0.996107 False 1.019344 False 0.128783

2 rows × 28 columns

In [105]:
fig, ax = plt.subplots(figsize= (20,5))
sns.boxplot(df["Avg_Payment_by_Median_Income"])
plt.title("Distribution of Average Total Payments by Zipcode Median Income", fontsize=20)
Out[105]:
Text(0.5, 1.0, 'Distribution of Average Total Payments by Zipcode Median Income')

The ones which are over 5 might need some investigation.

Feature 17

Ratio of 'Total Discharges' to 'Zip_Population'

'Total_Disc_by_Pop'

Hospitals havings very ratio are likely to be showing false patients.

Those with higher ratio, might need further investigation. Even the same hospital may have lower ratio for a particular DRG or State, but higher for another one, so I will not group the data. I need individual ratio for each row entry.

In [106]:
df['Total_Disc_by_Pop'] = df['Total_Discharges'] / df['Zip_Population']
df.head(2)
Out[106]:
DRG_Definition Provider_Id Provider_Name Provider_Street_Address Provider_City Provider_State Provider_Zip_Code Hospital_Referral_Region_Description Total_Discharges Average_Covered_Charges ... Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score Median_Score_Flag Median_Score_by_Provider Provider_Flag_by_Median_Score Avg_Payment_by_Median_Income Total_Disc_by_Pop
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 91 $32963.07 ... 1013.51 11.137473 3635 5658.33 1.021015 False 1.019344 False 0.152001 0.002545
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... 10001 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE DOTHAN AL 36301 AL - Dothan 38 $20312.78 ... 1029.26 27.085789 3635 4913.89 0.996107 False 1.019344 False 0.128783 0.001063

2 rows × 29 columns

In [107]:
fig, ax = plt.subplots(figsize= (20,5))
sns.boxplot(df["Total_Disc_by_Pop"])
plt.title("Distribution of Total Discharges by Zipcode Population", fontsize=20)
Out[107]:
Text(0.5, 1.0, 'Distribution of Total Discharges by Zipcode Population')

The ratio's over 10 might need investigation.

.

-------------------------------------- SECTION BREAK --------------------------------------

.

D. k-means Clustering

I will create a new copy of the data to perform the analysis, and call it 'df1'.

In [108]:
# Create a copy of the original dataset

df1 = df.copy()
In [109]:
# Check nulls

df1.isnull().sum().sum()
Out[109]:
48452

These NA's are created as a reuslt of merging the Income and the Population data by zipcode. There are certain zip codes in our data set, which do not have an entry n the zipcode dataset.

So, I will replace the NA's with the median of the respective column.

1. Impute missing Income and Population values with respective median

In [110]:
df1 = df1.fillna(df1.median())
In [111]:
df1.isnull().sum().sum()
Out[111]:
0

2. Drop irrelavant columns

There a lot of columns which will not be useful in the kmeans clustering analysis, most of which are categorical columns. I will drop them.

In [112]:
df1 = df1.drop(columns = ['Provider_Id', 'Provider_Name', 'Provider_Street_Address', 
                        'Provider_City', 'Provider_Zip_Code', 
                        'Hospital_Referral_Region_Description', 'Average_Covered_Charges', 
                        'Region',
                        'Division', 'Zip_Median_Income',
                        'Zip_Population', 'Median_Score_Flag',
                        'Provider_Flag_by_Median_Score'])
df1.head(2)
Out[112]:
DRG_Definition Provider_State Total_Discharges Average_Total_Payments Average_Medicare_Payments Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score Median_Score_by_Provider Avg_Payment_by_Median_Income Total_Disc_by_Pop
0 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC AL 91 5777.24 4763.73 63.486154 0.824568 0.816622 1013.51 11.137473 3635 5658.33 1.021015 1.019344 0.152001 0.002545
1 057 - DEGENERATIVE NERVOUS SYSTEM DISORDERS W/... AL 38 4894.76 3865.50 128.809474 0.789722 0.816622 1029.26 27.085789 3635 4913.89 0.996107 1.019344 0.128783 0.001063

3. Dummy coding all important categorical variables

I will drop the first entry/value from each variable to avoid multi-collinearity

In [113]:
#cat_variables = ['DRG_Definition', 'Provider_State']

#df = pd.get_dummies(
#    df,
#    columns = cat_variables,
#    drop_first=True
#)
#df.head(2)

I will drop the categrical columns, since I have alrwady used them to create meaningful new features

Label en-coding will be misleading for the clustering, and one-hot encoding creates many different new binary features, which is not ideal for a kmeans clustering.

In [114]:
df1 = df1.drop(columns = ['DRG_Definition', 'Provider_State'])

4. Final check for multi-collinearity

I will check multi-collinearity in all the features, and drop the ones with a very high cor ratio.

In [115]:
# All numeric / float variables in thedataset

num_variables_all = ['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments',
                'Average_Cost_Per_Procedure', 'Medicare_%_Paid',
       'Medicare_%_Paid_State', 'Out_of_Pocket_Payment',
       'Out_of_Pocket_per_discharge', 'State_Total', 'Median_Avg_Total_Pyt',
       'Median_Score', 'Median_Score_by_Provider',
       'Avg_Payment_by_Median_Income',
       'Total_Disc_by_Pop']
In [116]:
corr = df1[num_variables_all].corr()
f,ax = plt.subplots(figsize=(15,10))
sns.heatmap(corr, annot=True, cmap='Reds', linewidths=.4, fmt= '.1f', ax=ax)
plt.show()
In [117]:
# Remove one each from the pair of highly collinear variables

df1 = df1.drop(columns = ['Average_Medicare_Payments', 'Median_Avg_Total_Pyt', 'Out_of_Pocket_per_discharge',
                        'Average_Cost_Per_Procedure', 'Median_Score_by_Provider',
                       'Avg_Payment_by_Median_Income'])
In [118]:
# Final numeric variables selected

num_variables = ['Total_Discharges', 'Average_Total_Payments',
                'Medicare_%_Paid',
       'Medicare_%_Paid_State', 'Out_of_Pocket_Payment',
       'State_Total', 'Median_Score',
       'Total_Disc_by_Pop']
In [119]:
df1.columns
Out[119]:
Index(['Total_Discharges', 'Average_Total_Payments', 'Medicare_%_Paid',
       'Medicare_%_Paid_State', 'Out_of_Pocket_Payment', 'State_Total',
       'Median_Score', 'Total_Disc_by_Pop'],
      dtype='object')

5. Scaling all float or integer variables

I will use Standard Scalar

In [120]:
X = StandardScaler().fit_transform(df1[num_variables])
X = pd.DataFrame(data = X, columns = num_variables)
X.head()
Out[120]:
Total_Discharges Average_Total_Payments Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment State_Total Median_Score Total_Disc_by_Pop
0 0.943640 -0.512776 -0.249983 -1.562935 -0.173743 -0.656043 -0.140595 -0.028289
1 -0.093463 -0.627913 -0.646365 -1.562935 -0.160024 -0.656043 -0.258383 -0.031520
2 0.806665 0.072115 0.533658 -1.562935 -0.104393 -0.656043 -0.321171 -0.028715
3 2.469943 -0.412988 -0.173291 -1.562935 -0.095291 -0.656043 -0.093474 -0.023532
4 -0.191303 -0.666841 -0.673441 -1.562935 -0.205142 -0.656043 -0.166491 -0.031825
In [121]:
# df1[num_variables] = X
# df1.head(2)
In [122]:
X.isnull().sum().sum()
Out[122]:
0

6. k-Means clustering

I will perform my clustering on a range of 1-20 to get a comprehensive view.

In [123]:
# Define Within Cluster Sum of Squares, as an empty list

wcss = []

#this loop will fit the k-means algorithm to my data and 
#second I will compute the within cluster sum of squares and 
#append to our wcss list.

for i in range(1,20):
    kmeans = KMeans(n_clusters=i, max_iter=300)
    model = kmeans.fit(X)
    wcss.append(kmeans.inertia_)
In [124]:
plt.plot(range(1,20), wcss, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

As per above graphs, we have a small elbow at, Clusters = 2.

But for my analysis, I will refer to the other small elbow which I observe at Clusters = 8

2 clusters doesnt seem to correctly identify anomalies in a dataset of over 160k records.

7. Cluster analysis with 8 clusters

In [125]:
kmeans1 = KMeans(n_clusters = 8, init ='k-means++', max_iter=300, n_init=10,random_state=0)
kmeans1.fit(X)
Out[125]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)
In [126]:
# Add back the clusters labels to our data set

cluster = kmeans1.labels_
df['cluster'] = kmeans1.labels_
df1['cluster'] = kmeans1.labels_

8. Check the size of each cluster

In [127]:
# cluster size, number of observations in each cluster

temp = df['cluster'].value_counts()
df['cluster'].value_counts()
Out[127]:
0    48009
6    32512
2    30036
4    28937
7    14791
1     5879
3     2869
5       32
Name: cluster, dtype: int64

9. Percent of total data in each cluster

In [128]:
# Calculate the percentage of the points/rows in each cluster:

temp['Percentage of total'] = (df['cluster'].value_counts() / df['cluster'].value_counts().sum()) * 100
temp = pd.DataFrame(temp['Percentage of total']).reset_index()
temp.rename(columns={'index':'cluster',
                   'cluster':'Percentage of total'}, inplace=True)
temp
Out[128]:
cluster Percentage of total
0 0 29.441634
1 6 19.938062
2 2 18.419649
3 4 17.745684
4 7 9.070616
5 1 3.605311
6 3 1.759421
7 5 0.019624

Observation:

From the above 8 cluster analysis, the following can be explained:

A. Clusters 5, 3, and 1 have a very small percentage of total data in their cluster pool. So, there are two possiblr options for these clusters:

  • Either they are true anomalies
  • They are provider of rare services, or very expensive operations, surgeries or services, peformed only by very limited number of providers in the country.

B. Cluster 7 has a roughly 10% of the data in its cluster pool. This is the critical cut-off point, as I believe anything near or 10% of the entire data, represents a different but normal cluster group.

C. The other clusters which have over 10% of the total data in their cluster pools, seem normal and fine. The differences lie probably due to state or the drg definition. But they are not not likely to be a cause for anomalies.

10. Group by the Clusters and see the mean of the all the clusters

This will help identify anomaly containing clusters, and induce further inspection.

In [129]:
display(df.groupby('cluster').mean().reset_index())    # cluster means, grouped by cluster
cluster Total_Discharges Average_Total_Payments Average_Medicare_Payments Zip_Median_Income Zip_Population Average_Cost_Per_Procedure Medicare_%_Paid Medicare_%_Paid_State Out_of_Pocket_Payment Out_of_Pocket_per_discharge State_Total Median_Avg_Total_Pyt Median_Score Median_Score_Flag Median_Score_by_Provider Provider_Flag_by_Median_Score Avg_Payment_by_Median_Income Total_Disc_by_Pop
0 0 34.178779 8558.132463 7620.932059 46900.885337 25897.371421 376.177545 0.877176 0.836406 937.200405 39.596838 2934.141140 8590.858992 1.009427 0.000000 1.025693 0.000000 0.207469 0.011794
1 1 230.034700 9570.325031 8240.711162 52934.442137 28136.087966 46.567179 0.848601 0.847928 1329.613870 6.474964 5928.728865 9029.520589 1.072764 0.001871 1.071055 0.000170 0.217948 0.058928
2 2 37.708550 8999.536645 8019.572278 62225.529283 32546.711806 371.214859 0.873584 0.876785 979.964367 39.418778 8557.322413 9302.482107 0.984021 0.000000 0.997609 0.000000 0.167944 0.008945
3 3 31.408853 27126.790247 19910.910397 55268.681205 28877.116324 1276.017062 0.693500 0.842343 7215.879850 356.633449 6003.688742 21752.679570 1.339791 0.102475 1.121742 0.001743 0.604501 0.006847
4 4 35.275046 5631.421325 4086.279345 51206.389392 27429.881752 240.357862 0.722748 0.836957 1545.141979 67.417803 4514.513599 5421.788474 1.033183 0.001624 1.008075 0.000000 0.126495 0.008206
5 5 119.593750 9845.125937 8761.566250 65804.531250 4.718750 122.624547 0.877105 0.866666 1083.559688 12.896910 2297.093750 9129.614844 1.074349 0.000000 1.063009 0.000000 0.155053 26.613839
6 6 37.373739 7763.049261 6798.663760 50018.147182 29273.749190 321.251837 0.858854 0.838414 964.385500 38.405960 9822.158003 7762.637698 1.015405 0.000000 1.030711 0.000000 0.177708 0.003721
7 7 35.133054 23799.425950 22531.960944 49134.827499 33421.231044 1069.153340 0.936155 0.854896 1267.465006 51.213529 6987.119262 19597.065459 1.367530 0.086472 1.345614 0.013657 0.582881 0.017647

ANOMALIES or SUSPICIOUS clusters:

The above table gives a comprehensive view of the means, as per the different clusters.

From the 14 features used for clustering, I generated 8 different clusters. The following are the key takeaways:

Total Discharges

  1. Clusters 1 and 5 have an extremely high value, away from the normal mean of other clusters. Cluster 1 has a mean almost 7 times the mean of other clusters, which is very a big concern, even though this cluster has like 3.6% of the total data points.

Average Total Payments

  1. Clusters 3 and 7 have an extremely high values. Cluster 3, which has around 1.7% of the total data points, has a mean of USD 27k, which is over 3 times the means of the other clusters. Same is for Cluster 7.

Average Medicare Payments

  1. Clusters 3 and 7 have an extremely high values. Cluster 3, which has around 1.7% of the total data points, has a mean of USD 20kk ,which is over 2-3 times the means of the other clusters. Same is for Cluster 7.

Average Cost per Procedure

  1. Clusters 3 and 7 have an extremely high values. Cluster 3, which has around 1.7% of the total data points, has a mean of USD 1.2k ,which is over 2-3 times the means of the other clusters. Same is for Cluster 7.

Total Discharges (Disc) by Population

  1. Cluster 5 has extremely high values. Cluster 5, which has around 0.1% of the total data points, has a mean of 26 ,which is over by a very very large amount than the means of the other clusters.

Average Payment by Median Income

  1. Clusters 1, 3 and 7 have an extremely high values.

Out of Pocket Payment

  1. Cluster 3 has extremely high values.

Median_Score_by_Provider

  1. Clusters 3 and 7 have the highest means amongst all other clusters means.

From the above observations, it is clear that there are 4 distinct clusters, which have very high cluster means for a lot of variables, and emerge as suspicious cases or anomalies. These are 1, 3, 5, and 7

Combining these results with the Percentage results obtained in the previous step, I can concule that Clusters 1, 3 and 5 are suspicious and need further investigation.

11. Visualizations

First I will try to plot a combination of two features, it will help to undertsand the clusters better, and their distribution

For visualization, I have used the following code as reference: https://www.kaggle.com/ellecf/visualizing-multidimensional-clusters

In [130]:
sns.lmplot(x='Average_Total_Payments', y='Median_Score',data=df,hue='cluster',fit_reg=False)
Out[130]:
<seaborn.axisgrid.FacetGrid at 0x7fa76726e7d0>
In [131]:
sns.lmplot(x='Out_of_Pocket_Payment', y='Median_Score',data=df,hue='cluster',fit_reg=False)
Out[131]:
<seaborn.axisgrid.FacetGrid at 0x7fa7672c07d0>
In [132]:
sns.lmplot(x='Medicare_%_Paid', y='Median_Score',data=df,hue='cluster',fit_reg=False)
Out[132]:
<seaborn.axisgrid.FacetGrid at 0x7fa767450850>
In [133]:
sns.lmplot(x='Medicare_%_Paid', y='Average_Total_Payments',data=df,hue='cluster',fit_reg=False)
Out[133]:
<seaborn.axisgrid.FacetGrid at 0x7fa767639dd0>
In [134]:
sns.lmplot(x='Out_of_Pocket_Payment', y='Average_Total_Payments',data=df,hue='cluster',fit_reg=False)
Out[134]:
<seaborn.axisgrid.FacetGrid at 0x7fa767194850>
In [135]:
sns.lmplot(x='Total_Discharges', y='Average_Total_Payments',data=df,hue='cluster',fit_reg=False)
Out[135]:
<seaborn.axisgrid.FacetGrid at 0x7fa767200790>
In [136]:
sns.lmplot(x='Total_Disc_by_Pop', y='Average_Total_Payments',data=df,hue='cluster',fit_reg=False)
Out[136]:
<seaborn.axisgrid.FacetGrid at 0x7fa76a7696d0>
In [137]:
sns.lmplot(x='Total_Discharges', y='Total_Disc_by_Pop',data=df,hue='cluster',fit_reg=False)
Out[137]:
<seaborn.axisgrid.FacetGrid at 0x7fa76b657a10>

Above, I plotted the pairs of two features, and tried to see the clusters, and the specific data points which seem to be as anomalies. They are all color-coded, by each data point by the cluster to which it was assigned.

However, we can't see 8 clearly distinct clusters just by plotting two variables at a time; as I have a total of 14 numerical variables contributing to the clustering that we have to consider to get the full picture. We can't plot all 14 variables together on one plot, like the one above. Hahah. So, for future, we could try to plot every variable against every other variable.

Check the Cluster-wise visualization of each of the 14 numeric variables:

In [138]:
#This is just to add something constant for the strip/swarm plots' X axis. Can be anything you want it to be.

df['Constant'] = "Data" 
In [139]:
# create a 3x5 grid of empty figures where we will plot our feature plots. We will have a couple empty ones.
f, axes = plt.subplots(3, 5, figsize=(25, 25), sharex = False) 

#Scooch em apart, give em some room
f.subplots_adjust(hspace=0.3, wspace=0.7) 

# In this for loop, I step through every column that I want to plot. This is a 3x5 grid, so I split this up by rows of 5 in the else if statements

for i in range(0,len(list(df[num_variables_all]))):
    col = df[num_variables_all].columns[i]
    if i < 5:
        ax = sns.stripplot(x=df['Constant'],y=df[col].values,hue=df['cluster'],jitter=True,ax=axes[0,(i)])
        ax.set_title(col)
    elif i >= 5 and i<10:
        ax = sns.stripplot(x=df['Constant'],y=df[col].values,hue=df['cluster'],jitter=True,ax=axes[1,(i-5)]) #so if i=6 it is row 1 column 1
        ax.set_title(col)
    elif i >= 10 and i<15:
        ax = sns.stripplot(x=df['Constant'],y=df[col].values,hue=df['cluster'],jitter=True,ax=axes[2,(i-10)])
        ax.set_title(col)

This following will generate a series of strip plots. Seaborn plots one data point for each row and we've color coded the points by the cluster to which they were assigned. Adding jitter fans out the points horizontally. In a strip plot, the points can overlap. In a swarm plot (below), the points cannot overlap.

In [140]:
#f, axes = plt.subplots(3, 5, figsize=(25, 25), sharex = False) 
#f.subplots_adjust(hspace=0.3, wspace=0.7)

#for i in range(0,len(list(df[num_variables_all]))):
#    col = df[num_variables_all].columns[i]
#    if i < 5:
#        ax = sns.swarmplot(x=df['Constant'],y=df[col].values,hue=df['cluster'],ax=axes[0,(i)])
#        ax.set_title(col)
#    elif i >= 5 and i<10:
#        ax = sns.swarmplot(x=df['Constant'],y=df[col].values,hue=df['cluster'],ax=axes[1,(i-5)])
#        ax.set_title(col)
#    elif i >= 10 and i<15:
#        ax = sns.swarmplot(x=df['Constant'],y=df[col].values,hue=df['cluster'],ax=axes[2,(i-10)])
#        ax.set_title(col)

Conclusion:

For the entire healtcare fraud and anomaly detection assignment, I created 19 EDA's displaying the various distributions, relations and characteristics of the variables. I also created 17 new features.

For performing the kmeans, I ussed 8 clusters as the optimum number of clusters. An important observation was that:

  • Even though some clusters hava avery high in cluster mean, but they contaian a good portion of the total data. Hence we cannot call that cluster as suspicious. So, finally, per my anslysis, I can say that clusters 1, 3 and 5 were suspicious, as they contained fewer than 5% of the total data points, and also, they had high in-cluster means for a lot of variables, which makes them very suspicious.
In [ ]: